The idea behind this notebook is to find as many different things in cities that public transport (especially MRT) influences and changes. This will be done using data downloaded from various sites across the internet.
We will begin by importing needed packages.
#import necessary libraries
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import seaborn as sns
Now we will load data into a single dataframe. These figures come from different sites however each of the site is linked below.
#load and clean up data
area = pd.read_csv("area.csv") #source: https://www.worldometers.info/population/largest-cities-in-the-world/
population = pd.read_csv("worldcities.csv") #source: https://www.kaggle.com/aziz69/worldcities
del population['city_ascii']
del population['lat']
del population['lng']
del population['country']
del population['iso2']
del population['iso3']
del population['admin_name']
del population['capital']
del population['id']
population = population[:2000]
data_quality_of_life = pd.read_csv("movehubqualityoflife.csv") #source: https://www.kaggle.com/blitzr/movehub-city-rankings?select=movehubqualityoflife.csv
del data_quality_of_life['Purchase Power']
del data_quality_of_life['Health Care']
data_quality_of_life.columns=('city_name', 'Mh_rating', 'Pollution', 'Quality_of_life', 'Crime Rating')
#GDP is presented in billions of dollars
gdp = pd.read_csv("gdp.csv") #source: https://de.wikipedia.org/wiki/Liste_der_St%C3%A4dte_nach_Bruttoinlandsprodukt
transport_data = pd.read_csv("transport_data.csv") #source: https://en.wikipedia.org/wiki/List_of_metro_systems
transport_data = transport_data.groupby("city_name").sum() #ridership column is annual ridership presented in millions
merged_data = transport_data.merge(gdp,on='city_name')
merged_data = merged_data.merge(data_quality_of_life, on='city_name')
merged_data = merged_data.merge(population, on='city_name')
merged_data.drop([4,23],axis=0,inplace=True)
merged_data = merged_data.reset_index()
del merged_data['index']
full_data = merged_data.copy()
#create "per capita" columns
aux = list(merged_data.columns)
for column_name in aux:
if column_name in ["city_name","population","Mh_rating","Pollution","Quality_of_life","Crime Rating"]:
continue
full_data[(column_name + "_pp")] = full_data[column_name]/full_data["population"] * 10e5 #this will actually "be per million of residents"
#Except:
#GDP_pp is thousand of $ per capita
#ridership is actually per capita
#create per square kilometer columns
area_data = full_data.merge(area,on="city_name")
aux = list(area_data.columns)
for column_name in aux:
if column_name in ["city_name","population","Mh_rating","Pollution","Quality_of_life","Crime Rating","area","density"] or column_name.endswith("_pp"):
continue
area_data[(column_name + "_pkm")] = area_data[column_name]/area_data["area"]
#add congestion data
file_input = open("congestion.txt", "r", encoding='utf8') #source: https://www.tomtom.com/en_gb/traffic-index/ranking/
dict = {}
for line in file_input:
for city in list(full_data['city_name']):
if line.find(city) != -1:
index = line.find("%")
dict[city]=int(line[index-2:index])
file_input.close()
df = pd.DataFrame(dict, index=[0])
df = df.transpose()
df = df.reset_index()
df.columns = ['city_name','congestion ratio']
con_data = area_data.merge(df, on='city_name')
con_data
| city_name | stations | system_length | ridership | GDP | Mh_rating | Pollution | Quality_of_life | Crime Rating | population | ... | system_length_pp | ridership_pp | GDP_pp | area | density | stations_pkm | system_length_pkm | ridership_pkm | GDP_pkm | congestion ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Atlanta | 38 | 76.6 | 24.10 | 294.420 | 84.92 | 63.09 | 80.51 | 56.04 | 5449398.0 | ... | 14.056599 | 4.422507 | 54.027986 | 6851 | 700 | 0.005547 | 0.011181 | 0.003518 | 0.042975 | 26 |
| 1 | Bangkok | 113 | 139.2 | 332.20 | 306.765 | 74.21 | 60.39 | 37.54 | 36.10 | 17066000.0 | ... | 8.156569 | 19.465604 | 17.975214 | 2590 | 5800 | 0.043629 | 0.053745 | 0.128263 | 0.118442 | 53 |
| 2 | Barcelona | 159 | 122.3 | 217.93 | 171.032 | 83.37 | 71.75 | 47.18 | 41.47 | 4588000.0 | ... | 26.656495 | 47.500000 | 37.278117 | 1075 | 4400 | 0.147907 | 0.113767 | 0.202726 | 0.159100 | 29 |
| 3 | Beijing | 342 | 727.0 | 2292.65 | 506.137 | 74.17 | 57.18 | 25.69 | 47.61 | 19433000.0 | ... | 37.410590 | 117.977152 | 26.045232 | 3820 | 5500 | 0.089529 | 0.190314 | 0.600170 | 0.132497 | 37 |
| 4 | Belo Horizonte | 19 | 28.1 | 58.40 | 84.686 | 71.79 | 59.73 | 36.26 | 50.99 | 5159000.0 | ... | 5.446792 | 11.320023 | 16.415197 | 1088 | 4200 | 0.017463 | 0.025827 | 0.053676 | 0.077836 | 35 |
| 5 | Boston | 51 | 61.0 | 57.50 | 360.110 | 82.20 | 61.77 | 82.81 | 40.69 | 4688346.0 | ... | 13.010985 | 12.264453 | 76.809604 | 5325 | 800 | 0.009577 | 0.011455 | 0.010798 | 0.067626 | 26 |
| 6 | Buenos Aires | 90 | 56.7 | 321.30 | 315.885 | 77.29 | 47.26 | 39.78 | 59.46 | 16157000.0 | ... | 3.509315 | 19.886117 | 19.550969 | 2681 | 5300 | 0.033570 | 0.021149 | 0.119843 | 0.117824 | 35 |
| 7 | Chicago | 145 | 165.4 | 76.00 | 563.188 | 83.01 | 22.98 | 80.71 | 54.52 | 8604203.0 | ... | 19.223163 | 8.832892 | 65.454988 | 6856 | 1300 | 0.021149 | 0.024125 | 0.011085 | 0.082145 | 28 |
| 8 | Delhi | 229 | 347.6 | 926.10 | 378.449 | 77.12 | 73.03 | 42.45 | 52.56 | 29617000.0 | ... | 11.736503 | 31.269203 | 12.778100 | 2072 | 12100 | 0.110521 | 0.167761 | 0.446959 | 0.182649 | 56 |
| 9 | Dubai | 56 | 89.5 | 113.60 | 174.919 | 98.44 | 30.81 | 85.16 | 19.36 | 2502715.0 | ... | 35.761163 | 45.390706 | 69.891698 | 1347 | 2900 | 0.041574 | 0.066444 | 0.084336 | 0.129858 | 21 |
| 10 | Istanbul | 107 | 135.7 | 262.30 | 348.721 | 80.22 | 44.98 | 45.67 | 42.36 | 15154000.0 | ... | 8.954731 | 17.308961 | 23.011812 | 1360 | 9800 | 0.078676 | 0.099779 | 0.192868 | 0.256412 | 55 |
| 11 | Jakarta | 13 | 15.7 | 9.90 | 321.315 | 69.15 | 57.91 | 15.66 | 48.27 | 34540000.0 | ... | 0.454545 | 0.286624 | 9.302693 | 3225 | 9500 | 0.004031 | 0.004868 | 0.003070 | 0.099633 | 53 |
| 12 | Kuala Lumpur | 104 | 142.5 | 113.20 | 171.772 | 77.88 | 69.65 | 42.40 | 68.91 | 8285000.0 | ... | 17.199759 | 13.663247 | 20.732891 | 1943 | 3600 | 0.053525 | 0.073340 | 0.058260 | 0.088406 | 37 |
| 13 | Lima | 26 | 34.6 | 110.40 | 176.447 | 73.19 | 58.49 | 22.67 | 65.31 | 9848000.0 | ... | 3.513404 | 11.210398 | 17.917039 | 919 | 11700 | 0.028292 | 0.037650 | 0.120131 | 0.191999 | 57 |
| 14 | London | 315 | 436.0 | 1453.80 | 835.658 | 100.00 | 57.95 | 47.89 | 50.14 | 10979000.0 | ... | 39.712178 | 132.416431 | 76.114218 | 1738 | 5900 | 0.181243 | 0.250863 | 0.836479 | 0.480816 | 22 |
| 15 | Los Angeles | 16 | 28.0 | 22.80 | 860.452 | 86.86 | 75.20 | 62.82 | 65.74 | 12750807.0 | ... | 2.195939 | 1.788122 | 67.482160 | 6299 | 2400 | 0.002540 | 0.004445 | 0.003620 | 0.136601 | 42 |
| 16 | Madrid | 242 | 288.5 | 677.47 | 262.335 | 85.37 | 55.77 | 59.87 | 39.34 | 6026000.0 | ... | 47.875871 | 112.424494 | 43.533853 | 1321 | 4700 | 0.183195 | 0.218395 | 0.512846 | 0.198588 | 23 |
| 17 | Manila | 46 | 54.1 | 315.10 | 182.842 | 68.31 | 65.22 | 13.14 | 66.36 | 23088000.0 | ... | 2.343209 | 13.647782 | 7.919352 | 1580 | 15300 | 0.029114 | 0.034241 | 0.199430 | 0.115723 | 71 |
| 18 | Mexico City | 163 | 200.9 | 1655.40 | 403.561 | 75.07 | 18.95 | 27.91 | 77.86 | 20996000.0 | ... | 9.568489 | 78.843589 | 19.220852 | 2072 | 9700 | 0.078668 | 0.096959 | 0.798938 | 0.194769 | 52 |
| 19 | Miami | 23 | 40.1 | 9.60 | 262.697 | 84.43 | 22.45 | 74.77 | 64.85 | 6445545.0 | ... | 6.221351 | 1.489401 | 40.756367 | 3209 | 1800 | 0.007167 | 0.012496 | 0.002992 | 0.081863 | 31 |
| 20 | Moscow | 241 | 412.1 | 2560.70 | 553.318 | 76.35 | 30.03 | 18.65 | 53.19 | 17125000.0 | ... | 24.064234 | 149.529927 | 32.310540 | 4662 | 3500 | 0.051695 | 0.088396 | 0.549271 | 0.118687 | 59 |
| 21 | Mumbai | 12 | 11.4 | 118.00 | 221.192 | 77.13 | 70.49 | 26.43 | 34.63 | 23355000.0 | ... | 0.488118 | 5.052451 | 9.470863 | 546 | 32400 | 0.021978 | 0.020879 | 0.216117 | 0.405114 | 65 |
| 22 | New York | 458 | 443.7 | 1730.20 | 1403.463 | 90.53 | 12.01 | 61.94 | 47.33 | 18713220.0 | ... | 23.710511 | 92.458700 | 74.998477 | 11642 | 1800 | 0.039340 | 0.038112 | 0.148617 | 0.120552 | 37 |
| 23 | Paris | 304 | 219.9 | 1497.70 | 715.080 | 95.92 | 59.18 | 52.62 | 53.75 | 11020000.0 | ... | 19.954628 | 135.907441 | 64.889292 | 2845 | 3800 | 0.106854 | 0.077293 | 0.526432 | 0.251346 | 39 |
| 24 | Philadelphia | 88 | 82.0 | 41.60 | 346.455 | 83.31 | 38.64 | 65.53 | 68.58 | 5649300.0 | ... | 14.515073 | 7.363744 | 61.327067 | 5131 | 1100 | 0.017151 | 0.015981 | 0.008108 | 0.067522 | 24 |
| 25 | Shanghai | 369 | 743.0 | 2834.69 | 594.005 | 75.69 | 61.74 | 31.66 | 16.51 | 22120000.0 | ... | 33.589512 | 128.150542 | 26.853752 | 3820 | 6100 | 0.096597 | 0.194503 | 0.742065 | 0.155499 | 31 |
| 26 | Sydney | 13 | 36.0 | 12.90 | 223.413 | 94.53 | 18.48 | 74.32 | 32.80 | 5312163.0 | ... | 6.776900 | 2.428389 | 42.056880 | 2037 | 2000 | 0.006382 | 0.017673 | 0.006333 | 0.109677 | 33 |
| 27 | Tokyo | 249 | 316.3 | 4027.30 | 1616.792 | 80.00 | 30.54 | 69.29 | 13.91 | 37977000.0 | ... | 8.328725 | 106.045765 | 42.572926 | 8547 | 4400 | 0.029133 | 0.037007 | 0.471195 | 0.189165 | 42 |
| 28 | Toronto | 75 | 76.5 | 166.00 | 276.313 | 88.42 | 8.95 | 77.02 | 27.40 | 5429524.0 | ... | 14.089633 | 30.573583 | 50.890833 | 2287 | 2800 | 0.032794 | 0.033450 | 0.072584 | 0.120819 | 33 |
| 29 | Washington | 91 | 188.0 | 68.10 | 442.212 | 81.74 | 30.55 | 76.92 | 51.43 | 5379184.0 | ... | 34.949539 | 12.659913 | 82.208008 | 3424 | 1400 | 0.026577 | 0.054907 | 0.019889 | 0.129151 | 29 |
30 rows × 21 columns
Dataframe above contains data from 30 cities across the world all of which are going to be analysed to see if there are trends, tendecies, correlations and causations between public transport and popular indicators.
Here you can see all columns of the dataset. Note that "_pp" doesn't necessarily mean "per capita" - see code above
con_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 30 entries, 0 to 29 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 city_name 30 non-null object 1 stations 30 non-null int64 2 system_length 30 non-null float64 3 ridership 30 non-null float64 4 GDP 30 non-null float64 5 Mh_rating 30 non-null float64 6 Pollution 30 non-null float64 7 Quality_of_life 30 non-null float64 8 Crime Rating 30 non-null float64 9 population 30 non-null float64 10 stations_pp 30 non-null float64 11 system_length_pp 30 non-null float64 12 ridership_pp 30 non-null float64 13 GDP_pp 30 non-null float64 14 area 30 non-null int64 15 density 30 non-null int64 16 stations_pkm 30 non-null float64 17 system_length_pkm 30 non-null float64 18 ridership_pkm 30 non-null float64 19 GDP_pkm 30 non-null float64 20 congestion ratio 30 non-null int64 dtypes: float64(16), int64(4), object(1) memory usage: 5.2+ KB
#create auxillairy lists
transport_columns = ["stations","system_length","ridership","stations_pp","system_length_pp","ridership_pp","stations_pkm","system_length_pkm","ridership_pkm"]
other_data_columns = list(con_data.columns)
for x in transport_columns:
other_data_columns.remove(x)
other_data_columns.remove("city_name")
transport_columns_boolean = []
other_data_columns_boolean = []
for x in list(con_data.columns):
transport_columns_boolean.append(x in transport_columns)
for x in transport_columns_boolean:
other_data_columns_boolean.append(not x)
Dataframe above is pretty huge to be easily analysed so let's begin with calculating correlation between transport columns and other columns
#correlation matrix
con_data_corr = con_data.corr()
con_data_corr.drop(other_data_columns,axis=0,inplace=True)
con_data_corr.drop(transport_columns,axis=1,inplace=True)
con_data_corr.style.background_gradient(cmap="PiYG",vmin=-1,vmax=1,axis=None)
| GDP | Mh_rating | Pollution | Quality_of_life | Crime Rating | population | GDP_pp | area | density | GDP_pkm | congestion ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| stations | 0.634825 | 0.198068 | -0.100830 | -0.122691 | -0.213609 | 0.294118 | 0.197223 | 0.378515 | -0.156165 | 0.283874 | -0.138511 |
| system_length | 0.472807 | 0.011266 | -0.016827 | -0.214334 | -0.270638 | 0.321947 | 0.058839 | 0.263589 | -0.109424 | 0.207464 | -0.111745 |
| ridership | 0.708002 | -0.045004 | -0.152776 | -0.228795 | -0.330004 | 0.568217 | -0.024364 | 0.376605 | -0.035747 | 0.236010 | 0.095536 |
| stations_pp | 0.168050 | 0.518279 | -0.050774 | 0.237838 | -0.204840 | -0.351706 | 0.489747 | 0.029194 | -0.340490 | 0.219919 | -0.563939 |
| system_length_pp | 0.139044 | 0.421229 | -0.078449 | 0.211041 | -0.286738 | -0.299898 | 0.465743 | 0.020168 | -0.345712 | 0.180790 | -0.575167 |
| ridership_pp | 0.516346 | 0.224656 | -0.092206 | -0.206100 | -0.263502 | 0.233698 | 0.149506 | 0.155351 | -0.114917 | 0.385131 | -0.099894 |
| stations_pkm | 0.062807 | 0.236328 | 0.259432 | -0.202732 | -0.120055 | -0.000069 | 0.005256 | -0.330721 | 0.052994 | 0.564461 | -0.169206 |
| system_length_pkm | 0.083304 | 0.127323 | 0.235056 | -0.254031 | -0.180030 | 0.095457 | -0.041234 | -0.280599 | 0.052393 | 0.508909 | -0.152991 |
| ridership_pkm | 0.331981 | 0.018147 | 0.075924 | -0.411183 | -0.115928 | 0.390171 | -0.130161 | -0.115400 | 0.188826 | 0.573773 | 0.107821 |
As you can see we get some interesting results. First GDP seems to be somewhat correlated to the number of stations, length of the system and ridership. However the main reason for that would be that bigger cities have higher population which means both higher GDP but also bigger public transport network and especially: higher ridership. So here the main cause would probably be population and we can see that these factors have somewhat significant correlation coefficients however except for ridership they are significantly smaller than correlation coefficients with GDP. So despite that both variables are influenced by population we can assume that size of MRT influences GDP (or the other way around).
Also ridership per capita seems to be connected to GDP which means that more people using public transport in the city make the economy grow. However this does not necessarily imply that if we all switch to using underground our economy will suddenly grow because, it likely won't. Instead it probably has more to do with lower class that cannot afford cars so they either use public transport to move around city or they simply don't visit their city that often as it is not so easy for them. (population column contains population of metropolitan areas - not just urban areas so people living in vicinity to their cities are counted here so this causation would make sense)
We can also see that cities which have more stations and longer systems per capita are considered to be "better" (at least by a movehub website). It would seem that cities with stations that are less overcrowded and transport systems that are more accessible are unsuprisingly considered to be better.
Pollution however doesn't really seem to be correlated with public transport which may be surprising. However it would seem that cities with more dense networks have higher pollution but still the coefficient isn't that large here.
Quality of life (which is measured by healthcare,pollution,purchase power and crime rate) seems to actually be decreasing with denser and larger networks unless we take into considaration only per capita columns.
Other interesting coefficients are betweem stations per capita,system length per capita and GDP per capita, density and congestion ratio. It would seem here that better MRT system actually significantly contributes to economic growth, somewhat decreases density as people can live further from their workplace and has very important role in fighting traffic congestion. What is surprising is that ridership per capita isn't really corelated with the values. This could mean that cities benefit most from oppurtinties that public transport provides rather than actual number of MRT users.
Now lastly we notice high coefficients between GDP per square kilometer and public transport availability and ridership per square kilometer. This could mean that cities with denser MRT systems use their land better as they can make more money out of same amount of land.
Now let's draw some plots to better understand influence between those columns and verify what we already assumed
#combined plot
sorting_criteria = "GDP_pp"
standardised = con_data.copy()
std = con_data.describe().transpose()['std']
mean = con_data.describe().transpose()['mean']
aux = list(con_data.columns)
for column in aux:
if column=="city_name":
continue
standardised[column]=(standardised[column]-mean[column])/std[column]
standardised = standardised.sort_values(sorting_criteria,ascending=False).reset_index()
del standardised['index']
standardised.plot(figsize = (30,15))
<AxesSubplot:>
Little can we tell from the plot above so let's actually use some subplots
#sorted subplots
for sorting_column in ["GDP", "GDP_pp", "Pollution","congestion ratio"]: #plots subplots of transport data with cities sorted in a descending order based on sepcified criteria
standardised = standardised.sort_values(sorting_column,ascending=False).reset_index() #data is standardised - positive values mean above average performance and vice versa
del standardised['index']
standardised.iloc[:,transport_columns_boolean].plot(subplots=True, figsize =(15,30), title=sorting_column)
From the first figure we can confirm that the correlation between stations, system length, ridership and GDP clearly exists however it isn't really consistent since we can find cities with low values for these transport data and that still have high GDP (This seems to be the case for Los Angeles and Chicago). Other transport columns also seem to be slightly correlated with GDP as the trend in the subplots is that values slowly decrease. However there is one outlier that seems to have great transport system but ranks 21st in terms of GDP. It is Madrid. Also Barcelona ranks 28th in GDP but seems to have a good transport system.
There is definetely a trend here but causation is still unsure.
When cities are sorted by GDP per capita it seems to be completely random except stations per capita and system length per capita which show a slight correlation - all cities that have these indicators clearly higher than average are ranked in the top 20 of these 30 cities however there are several cities that also managed to get to top 20 despite low values - (Los Angeles, Atlanta, Toronto, Tokyo, Sydney, Miami). Now we are not surprised to see US cities again, also Sydney has only built 1 out of 4 MRT lines so far. However presence of Tokyo here is surprising as it has a huge transport network. However it should be pointed out that out of these cities only Tokyo has very high ridership per capita so it seems that system is just better adjusted and each station and line has higher capacity than other cities.
Tokyo actually has an immense network of trains which serve the metropolitan area and data used in the study doesn't include lines operated by Japan Railway Group (only Toei Subway, Tokyo Metro and Rinkai line have been counted. those lines operate mostly in Tokyo city proper however in terms of population we are using data from metropolitan areas. This is probably why Tokyo has so low values of these columns (Tokyo metropolitan area is the most populated in the world - almost 40 million people live there while in the city proper there aren't even 10 million inhabitants))
Pollution seems to be completely random
There seems to be a trend in congestion. Cities with high number of stations per capita and system length per capita almost allways experience smaller traffic however there is an outlier here - Moscow seems to have trouble with traffic despite being slightly higher than average in terms of transport. There are also some cities with average transport network that make it to the top of lowest congestion.
#cities in order from upper subplots
for sorting_column in ["GDP", "GDP_pp", "Pollution","congestion ratio"]:
print("sorted descending by: " + sorting_column)
print(con_data.sort_values(sorting_column,ascending=False).reset_index()["city_name"])
sorted descending by: GDP 0 Tokyo 1 New York 2 Los Angeles 3 London 4 Paris 5 Shanghai 6 Chicago 7 Moscow 8 Beijing 9 Washington 10 Mexico City 11 Delhi 12 Boston 13 Istanbul 14 Philadelphia 15 Jakarta 16 Buenos Aires 17 Bangkok 18 Atlanta 19 Toronto 20 Miami 21 Madrid 22 Sydney 23 Mumbai 24 Manila 25 Lima 26 Dubai 27 Kuala Lumpur 28 Barcelona 29 Belo Horizonte Name: city_name, dtype: object sorted descending by: GDP_pp 0 Washington 1 Boston 2 London 3 New York 4 Dubai 5 Los Angeles 6 Chicago 7 Paris 8 Philadelphia 9 Atlanta 10 Toronto 11 Madrid 12 Tokyo 13 Sydney 14 Miami 15 Barcelona 16 Moscow 17 Shanghai 18 Beijing 19 Istanbul 20 Kuala Lumpur 21 Buenos Aires 22 Mexico City 23 Bangkok 24 Lima 25 Belo Horizonte 26 Delhi 27 Mumbai 28 Jakarta 29 Manila Name: city_name, dtype: object sorted descending by: Pollution 0 Los Angeles 1 Delhi 2 Barcelona 3 Mumbai 4 Kuala Lumpur 5 Manila 6 Atlanta 7 Boston 8 Shanghai 9 Bangkok 10 Belo Horizonte 11 Paris 12 Lima 13 London 14 Jakarta 15 Beijing 16 Madrid 17 Buenos Aires 18 Istanbul 19 Philadelphia 20 Dubai 21 Washington 22 Tokyo 23 Moscow 24 Chicago 25 Miami 26 Mexico City 27 Sydney 28 New York 29 Toronto Name: city_name, dtype: object sorted descending by: congestion ratio 0 Manila 1 Mumbai 2 Moscow 3 Lima 4 Delhi 5 Istanbul 6 Jakarta 7 Bangkok 8 Mexico City 9 Tokyo 10 Los Angeles 11 Paris 12 Kuala Lumpur 13 Beijing 14 New York 15 Buenos Aires 16 Belo Horizonte 17 Sydney 18 Toronto 19 Shanghai 20 Miami 21 Washington 22 Barcelona 23 Chicago 24 Boston 25 Atlanta 26 Philadelphia 27 Madrid 28 London 29 Dubai Name: city_name, dtype: object
Now we will do the same thing but we will sort cities by transport data
#subplots sorted by transport columns
for sorting_column in ["stations_pp", "system_length_pp", "stations","system_length"]:
standardised = standardised.sort_values(sorting_column,ascending=False).reset_index()
del standardised['index']
standardised.iloc[:,other_data_columns_boolean].plot(subplots=True, figsize =(15,30), title=sorting_column)
From the subplots we can once again see that there is a correlation betwen stations per capita and GDP per capita, congestion ratio.
What is new is that cities with high density also have small number of stations per person. This might mean that when people lack public transport they are likely to settle closer to city centers so that they manage to get to the center in reasonable amount of time. MRT systems allow inhabitants to get to the city faster from further distances so when such systems are well built they can reduce overcrowding of cities.
When cities are sorted by system length per capita the subplots are very similair to those with cities sorted by stations per capita. (correlation between those 2 definitely makes sense) So the conclusion is essentially the same.
These subplots confirm correlation that we saw before between number of stations and GDP. Also correlation between area and stations should actually be pointed out. It might seem that when cities have more stations they can occupy more area as MRT connects city in a fast way. Also we can again see that high density cities have small number of stations.
Conclusions drawn from the last set of subplots are identical to those from subplots sorted by number of stations.
#cities in order from upper subplots
for sorting_column in ["stations_pp", "system_length_pp", "stations","system_length"]:
print("sorted descending by: " + sorting_column)
print(con_data.sort_values(sorting_column,ascending=False).reset_index()["city_name"])
sorted descending by: stations_pp 0 Madrid 1 Barcelona 2 London 3 Paris 4 New York 5 Dubai 6 Beijing 7 Washington 8 Chicago 9 Shanghai 10 Philadelphia 11 Moscow 12 Toronto 13 Kuala Lumpur 14 Boston 15 Mexico City 16 Delhi 17 Istanbul 18 Atlanta 19 Bangkok 20 Tokyo 21 Buenos Aires 22 Belo Horizonte 23 Miami 24 Lima 25 Sydney 26 Manila 27 Los Angeles 28 Mumbai 29 Jakarta Name: city_name, dtype: object sorted descending by: system_length_pp 0 Madrid 1 London 2 Beijing 3 Dubai 4 Washington 5 Shanghai 6 Barcelona 7 Moscow 8 New York 9 Paris 10 Chicago 11 Kuala Lumpur 12 Philadelphia 13 Toronto 14 Atlanta 15 Boston 16 Delhi 17 Mexico City 18 Istanbul 19 Tokyo 20 Bangkok 21 Sydney 22 Miami 23 Belo Horizonte 24 Lima 25 Buenos Aires 26 Manila 27 Los Angeles 28 Mumbai 29 Jakarta Name: city_name, dtype: object sorted descending by: stations 0 New York 1 Shanghai 2 Beijing 3 London 4 Paris 5 Tokyo 6 Madrid 7 Moscow 8 Delhi 9 Mexico City 10 Barcelona 11 Chicago 12 Bangkok 13 Istanbul 14 Kuala Lumpur 15 Washington 16 Buenos Aires 17 Philadelphia 18 Toronto 19 Dubai 20 Boston 21 Manila 22 Atlanta 23 Lima 24 Miami 25 Belo Horizonte 26 Los Angeles 27 Sydney 28 Jakarta 29 Mumbai Name: city_name, dtype: object sorted descending by: system_length 0 Shanghai 1 Beijing 2 New York 3 London 4 Moscow 5 Delhi 6 Tokyo 7 Madrid 8 Paris 9 Mexico City 10 Washington 11 Chicago 12 Kuala Lumpur 13 Bangkok 14 Istanbul 15 Barcelona 16 Dubai 17 Philadelphia 18 Atlanta 19 Toronto 20 Boston 21 Buenos Aires 22 Manila 23 Miami 24 Sydney 25 Lima 26 Belo Horizonte 27 Los Angeles 28 Jakarta 29 Mumbai Name: city_name, dtype: object
Now let's look at how columns interact with each other within single cities.
#standardised plot of all the cities
t_standardised = standardised.copy()
t_standardised.index = t_standardised["city_name"]
del t_standardised["city_name"]
t_standardised = t_standardised.transpose()
t_standardised.plot(subplots=True,kind="bar",figsize=(80,30),ylim=(-3,3.5),width=1,sharex=False,layout=(5,6))
plt.subplots_adjust(hspace=0.8)
plt.show()
There is a lot of data here and a lot of conclusions that can be drawn here have already been pointed out (cities with positive values for stations per capita have always negative values for congestion ratio)
There are however some interesting cities to see here.
London seems to stand out from the rest of the cities - it is doing well in all MRT indicators and also has high GDP, good rating and despite few flaws seems to be doing very well. On the other hand there is Madrid which MRT indicators are also generally high but economy doesn't seem so well here. Same thing goes for Shanghai and Beijing but those two seem to be doing even worse than Madrid.
Among US cities New York clearly stands out - it seems to have the best MRT and the best economy although most of the cities from US seem to have higher GDP per capita despite doing bad in public transport.
At this point once again we can see that economy is actually highly influenced by the country where the city is located in.
Let's try separating cities into clusters for better understanding this plot
#clustering
#create dataframe for clustering
clustering_data = con_data.set_index(con_data["city_name"])
del clustering_data["city_name"]
#Distortion and number of clusters
#copied from https://towardsdatascience.com/k-means-clustering-with-scikit-learn-6b47a369a83c
X = StandardScaler().fit_transform(clustering_data)
distortions = []
for i in range(1, 11):
km = KMeans(n_clusters=i)
km.fit(X)
distortions.append(km.inertia_)
plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()
There doesn't seem to be a clear indication of how many clusters we should use so let's try 4
#clustering
num_of_clusters=4
X = StandardScaler().fit_transform(clustering_data)
kmeans = KMeans(n_clusters=num_of_clusters)
model = kmeans.fit(X)
clusters = kmeans.predict(X)
clustering_results = clustering_data.copy()
clustering_results["cluster"] = clusters
for x in range(4):
print("///////////////////////cluster " + str(x) + "///////////////////////")
print(list(clustering_results[clustering_results["cluster"] == x].index))
///////////////////////cluster 0/////////////////////// ['Barcelona', 'London', 'Madrid', 'Paris'] ///////////////////////cluster 1/////////////////////// ['Atlanta', 'Boston', 'Chicago', 'Dubai', 'Los Angeles', 'Miami', 'Philadelphia', 'Sydney', 'Toronto', 'Washington'] ///////////////////////cluster 2/////////////////////// ['Bangkok', 'Belo Horizonte', 'Buenos Aires', 'Delhi', 'Istanbul', 'Jakarta', 'Kuala Lumpur', 'Lima', 'Manila', 'Mexico City', 'Mumbai'] ///////////////////////cluster 3/////////////////////// ['Beijing', 'Moscow', 'New York', 'Shanghai', 'Tokyo']
#plots for each cluster
single_graph_height=8
t_standardised = standardised.copy()
t_standardised = t_standardised.loc[t_standardised["city_name"].isin(list(clustering_results[clustering_results["cluster"] == 0].index))]
t_standardised.index = t_standardised["city_name"]
del t_standardised["city_name"]
t_standardised = t_standardised.transpose()
t_standardised.plot(subplots=True,kind="bar",ylim=(-3,3.5),
width=1,sharex=False,figsize=(10,single_graph_height*len(list(clustering_results[clustering_results["cluster"] == 0].index))))
plt.subplots_adjust(hspace=0.8)
plt.show()
In the first group we can see cities with a well performance in stations per capita and system length per capita. The cities also have small congestion. Their Quality of life, area and density seem to be pretty average. Also all of these are European cities
t_standardised = standardised.copy()
t_standardised = t_standardised.loc[t_standardised["city_name"].isin(list(clustering_results[clustering_results["cluster"] == 1].index))]
t_standardised.index = t_standardised["city_name"]
del t_standardised["city_name"]
t_standardised = t_standardised.transpose()
t_standardised.plot(subplots=True,kind="bar",ylim=(-3,3.5),
width=1,sharex=False,figsize=(10,single_graph_height*len(list(clustering_results[clustering_results["cluster"] == 1].index))))
plt.subplots_adjust(hspace=0.8)
plt.show()
In this cluster we have cities that usually have above average GDP per capita and Quality of life, however stations per capita and system length per capita are generally average. Congestion ratio is usually below average. Most of these cities are situated in US and Canada.
t_standardised = standardised.copy()
t_standardised = t_standardised.loc[t_standardised["city_name"].isin(list(clustering_results[clustering_results["cluster"] == 2].index))]
t_standardised.index = t_standardised["city_name"]
del t_standardised["city_name"]
t_standardised = t_standardised.transpose()
t_standardised.plot(subplots=True,kind="bar",ylim=(-3,3.5),
width=1,sharex=False,figsize=(10,single_graph_height*len(list(clustering_results[clustering_results["cluster"] == 2].index))))
plt.subplots_adjust(hspace=0.8)
plt.show()
Here we can see cities from Latin America and Asia. These usually do poorly in terms of MRT however some of these cities have a lot of stations per square kilometer. Their congestion ratio is usually high and economy is typically clearly below average.
t_standardised = standardised.copy()
t_standardised = t_standardised.loc[t_standardised["city_name"].isin(list(clustering_results[clustering_results["cluster"] == 3].index))]
t_standardised.index = t_standardised["city_name"]
del t_standardised["city_name"]
t_standardised = t_standardised.transpose()
t_standardised.plot(subplots=True,kind="bar",ylim=(-3,3.5),
width=1,sharex=False,figsize=(10,single_graph_height*len(list(clustering_results[clustering_results["cluster"] == 3].index))))
plt.subplots_adjust(hspace=0.8)
plt.show()
In the last cluster there are cities with largest number of stations and longest MRT systems. However the rest of the columns vary and there isn't really any other connection between these cities.
Here we only confirmed what we already knew about stations per capita and system length per capita - clusters where these values are above average seem to have low congestion and high GDP per capita. Also we can see that the way city is shaped depends a lot on the region where the city is situated. This basically dominated this clustering.
#boxplot
cluster_A = clustering_data.loc[clustering_results[clustering_results["cluster"] == 0].index]
cluster_B = clustering_data.loc[clustering_results[clustering_results["cluster"] == 1].index]
cluster_C = clustering_data.loc[clustering_results[clustering_results["cluster"] == 2].index]
cluster_D = clustering_data.loc[clustering_results[clustering_results["cluster"] == 3].index]
column_names = list(cluster_A.columns)
fig, axs = plt.subplots(20)
fig.set_size_inches(7, 120)
for x in range(20):
axs[x].boxplot([cluster_A[column_names[x]],cluster_B[column_names[x]],cluster_C[column_names[x]],cluster_D[column_names[x]]])
axs[x].set_title(column_names[x])
In the boxplot we can see that the cluster with cities that have huge MRT systems have a significantly larger GDP, however this cluster seems to have cities with large population but still it is not the cluster with most populous cities.
The overall rating for these cities seems to be the highest for European and American city so it seems as it have little connection with actual MRT availability.
It is quite similair with pollution where clusters 1 and 3 are most polluted.
In fact we can say a lot about these clusters however most of these things are not connected with MRT systems.
On the other hand in the last subplot we can see that cities with worst MRT systems have highest congestion.
Now let's try to do the same thing but only with transport data
#clustering purely on transport data
indicators = ["GDP","Mh_rating","Pollution","Quality_of_life","Crime Rating","population","GDP_pp","area","density","GDP_pkm","congestion ratio"]
clustering_transport_data = con_data.set_index(con_data["city_name"])
for x in indicators:
del clustering_transport_data[x]
del clustering_transport_data["city_name"]
#Distortion and number of clusters for transport data
#copied from https://towardsdatascience.com/k-means-clustering-with-scikit-learn-6b47a369a83c
X = StandardScaler().fit_transform(clustering_transport_data)
distortions = []
for i in range(1, 11):
km = KMeans(n_clusters=i)
km.fit(X)
distortions.append(km.inertia_)
plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()
2 clusters seem to be a good option here (now we have something that slightly resembles elbow - nice)
#clustering with 2 clusters
num_of_clusters=2
X = StandardScaler().fit_transform(clustering_transport_data)
kmeans = KMeans(n_clusters=num_of_clusters)
model = kmeans.fit(X)
print("model\n", model)
clusters = kmeans.predict(X)
clustering_transport_results = clustering_transport_data.copy()
clustering_transport_results["cluster"] = clusters
for x in range(2):
print("///////////////////////cluster " + str(x) + "///////////////////////")
print(clustering_transport_results[clustering_transport_results["cluster"] == x].index)
model
KMeans(n_clusters=2)
///////////////////////cluster 0///////////////////////
Index(['Barcelona', 'Beijing', 'Delhi', 'London', 'Madrid', 'Mexico City',
'Moscow', 'New York', 'Paris', 'Shanghai', 'Tokyo'],
dtype='object', name='city_name')
///////////////////////cluster 1///////////////////////
Index(['Atlanta', 'Bangkok', 'Belo Horizonte', 'Boston', 'Buenos Aires',
'Chicago', 'Dubai', 'Istanbul', 'Jakarta', 'Kuala Lumpur', 'Lima',
'Los Angeles', 'Manila', 'Miami', 'Mumbai', 'Philadelphia', 'Sydney',
'Toronto', 'Washington'],
dtype='object', name='city_name')
Now the groups are slightly different so let's how they look
cluster_A = clustering_data.loc[clustering_results[clustering_transport_results["cluster"] == 0].index]
cluster_A.describe()
| stations | system_length | ridership | GDP | Mh_rating | Pollution | Quality_of_life | Crime Rating | population | stations_pp | system_length_pp | ridership_pp | GDP_pp | area | density | stations_pkm | system_length_pkm | ridership_pkm | GDP_pkm | congestion ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 1.100000e+01 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 |
| mean | 279.181818 | 387.027273 | 1806.721818 | 676.348182 | 83.053636 | 48.011818 | 44.104545 | 44.879091 | 1.805402e+07 | 20.542967 | 25.691612 | 102.956659 | 41.508669 | 3964.909091 | 5627.272727 | 0.101335 | 0.133943 | 0.530518 | 0.198515 | 38.818182 |
| std | 89.498400 | 199.363162 | 1077.778336 | 456.158646 | 8.946710 | 21.248399 | 16.444324 | 17.725226 | 9.922472e+06 | 11.418928 | 12.980817 | 37.401084 | 21.793528 | 3304.278664 | 2918.592438 | 0.052504 | 0.073777 | 0.219030 | 0.101514 | 12.584261 |
| min | 159.000000 | 122.300000 | 217.930000 | 171.032000 | 74.170000 | 12.010000 | 18.650000 | 13.910000 | 4.588000e+06 | 6.556600 | 8.328725 | 31.269203 | 12.778100 | 1075.000000 | 1800.000000 | 0.029133 | 0.037007 | 0.148617 | 0.118687 | 22.000000 |
| 25% | 235.000000 | 254.200000 | 1189.950000 | 391.005000 | 76.020000 | 30.285000 | 29.785000 | 40.405000 | 1.099950e+07 | 10.918188 | 15.845565 | 85.651145 | 26.449492 | 1905.000000 | 4100.000000 | 0.065181 | 0.082845 | 0.459077 | 0.143998 | 30.000000 |
| 50% | 249.000000 | 347.600000 | 1655.400000 | 553.318000 | 80.000000 | 57.180000 | 47.180000 | 47.610000 | 1.871322e+07 | 17.598930 | 24.064234 | 112.424494 | 37.278117 | 2845.000000 | 4700.000000 | 0.096597 | 0.113767 | 0.526432 | 0.182649 | 37.000000 |
| 75% | 328.500000 | 439.850000 | 2426.675000 | 775.369000 | 87.950000 | 60.460000 | 56.245000 | 52.875000 | 2.155800e+07 | 28.138672 | 35.500051 | 130.283487 | 54.211573 | 4241.000000 | 6000.000000 | 0.129214 | 0.192408 | 0.671118 | 0.196679 | 47.000000 |
| max | 458.000000 | 743.000000 | 4027.300000 | 1616.792000 | 100.000000 | 73.030000 | 69.290000 | 77.860000 | 3.797700e+07 | 40.159310 | 47.875871 | 149.529927 | 76.114218 | 11642.000000 | 12100.000000 | 0.183195 | 0.250863 | 0.836479 | 0.480816 | 59.000000 |
cluster_B = clustering_data.loc[clustering_results[clustering_transport_results["cluster"] == 1].index]
cluster_B.describe()
| stations | system_length | ridership | GDP | Mh_rating | Pollution | Quality_of_life | Crime Rating | population | stations_pp | system_length_pp | ridership_pp | GDP_pp | area | density | stations_pkm | system_length_pkm | ridership_pkm | GDP_pkm | congestion ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 19.000000 | 19.000000 | 19.000000 | 19.000000 | 19.000000 | 19.000000 | 19.000000 | 19.000000 | 1.900000e+01 | 19.000000 | 19.000000 | 19.000000 | 19.000000 | 19.000000 | 19.000000 | 19.000000 | 19.000000 | 19.000000 | 19.000000 | 19.000000 |
| mean | 59.263158 | 76.900000 | 117.526316 | 312.305474 | 80.896316 | 47.738947 | 54.743158 | 50.200000 | 1.130859e+07 | 7.982532 | 11.098252 | 12.581838 | 39.642190 | 3089.368421 | 6042.105263 | 0.025302 | 0.032823 | 0.069206 | 0.128401 | 39.684211 |
| std | 41.284705 | 53.128743 | 110.289804 | 171.774743 | 7.983901 | 20.136592 | 24.933148 | 15.174328 | 8.401567e+06 | 6.616818 | 10.268484 | 11.023851 | 25.512857 | 2040.167973 | 7604.699993 | 0.019503 | 0.025871 | 0.073546 | 0.082273 | 14.817605 |
| min | 12.000000 | 11.400000 | 9.600000 | 84.686000 | 68.310000 | 8.950000 | 13.140000 | 19.360000 | 2.502715e+06 | 0.376375 | 0.454545 | 0.286624 | 7.919352 | 546.000000 | 700.000000 | 0.002540 | 0.004445 | 0.002992 | 0.042975 | 21.000000 |
| 25% | 21.000000 | 35.300000 | 32.850000 | 202.017000 | 75.670000 | 30.680000 | 36.900000 | 38.395000 | 5.404354e+06 | 2.543672 | 3.511359 | 4.737479 | 17.946126 | 1470.000000 | 1600.000000 | 0.008372 | 0.014239 | 0.007220 | 0.082004 | 28.500000 |
| 50% | 51.000000 | 61.000000 | 76.000000 | 294.420000 | 81.740000 | 57.910000 | 62.820000 | 51.430000 | 8.285000e+06 | 6.621352 | 8.156569 | 11.320023 | 40.756367 | 2590.000000 | 2900.000000 | 0.021978 | 0.024125 | 0.053676 | 0.115723 | 35.000000 |
| 75% | 90.500000 | 112.600000 | 142.000000 | 347.588000 | 84.675000 | 62.430000 | 76.970000 | 65.080000 | 1.565550e+07 | 13.183086 | 14.302353 | 15.486104 | 63.391027 | 4277.500000 | 7650.000000 | 0.033182 | 0.045697 | 0.119987 | 0.129504 | 53.000000 |
| max | 145.000000 | 188.000000 | 332.200000 | 860.452000 | 98.440000 | 75.200000 | 85.160000 | 68.910000 | 3.454000e+07 | 22.375700 | 35.761163 | 45.390706 | 82.208008 | 6856.000000 | 32400.000000 | 0.078676 | 0.099779 | 0.216117 | 0.405114 | 71.000000 |
#boxplot
column_names = list(cluster_A.columns)
fig, axs = plt.subplots(20)
fig.set_size_inches(7, 120)
for x in range(20):
axs[x].boxplot([cluster_A[column_names[x]],cluster_B[column_names[x]]])
axs[x].set_title(column_names[x])
Now all transport related columns are higher in cluster 1 (as we expected cities are divided into cluster with good and not so good MRT)
Also GDP is significantltly greater in cluster 1. Overall rating also seems to be higher for these cities. However Quality of Life seems to be better in cluster 2 and so does GDP per capita (but only slightly) (which contradicts previous statements).
Cities in cluster 1 have smaller congestion which we expected.
Other columns seem to be pretty random.
Now I think that we should try clustering using only stations per capita and system_length per capita as those were 2 most useful indicators in this study
#per capita transport clustering
indicators = ["GDP","Mh_rating","Pollution","Quality_of_life","Crime Rating","population","GDP_pp","area","density","GDP_pkm","congestion ratio","stations","system_length","ridership","ridership_pp","stations_pkm","system_length_pkm","ridership_pkm"]
clustering_transport_pp = con_data.set_index(con_data["city_name"])
for x in indicators:
del clustering_transport_pp[x]
del clustering_transport_pp["city_name"]
#Distortion and number of clusters for transport data
#copied from https://towardsdatascience.com/k-means-clustering-with-scikit-learn-6b47a369a83c
X = StandardScaler().fit_transform(clustering_transport_pp)
distortions = []
for i in range(1, 11):
km = KMeans(n_clusters=i)
km.fit(X)
distortions.append(km.inertia_)
plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()
Again we will use 2 clusters however notice the change in scale - now the distortion disappears almost entirely when using 2 clusters.
#clustering with 2 clusters
num_of_clusters=2
X = StandardScaler().fit_transform(clustering_transport_pp)
kmeans = KMeans(n_clusters=num_of_clusters)
model = kmeans.fit(X)
print("model\n", model)
clusters = kmeans.predict(X)
clustering_transport_results = clustering_transport_pp.copy()
clustering_transport_results["cluster"] = clusters
for x in range(2):
print("///////////////////////cluster " + str(x) + "///////////////////////")
print(clustering_transport_results[clustering_transport_results["cluster"] == x].index)
model
KMeans(n_clusters=2)
///////////////////////cluster 0///////////////////////
Index(['Barcelona', 'Beijing', 'Chicago', 'Dubai', 'London', 'Madrid',
'Moscow', 'New York', 'Paris', 'Shanghai', 'Washington'],
dtype='object', name='city_name')
///////////////////////cluster 1///////////////////////
Index(['Atlanta', 'Bangkok', 'Belo Horizonte', 'Boston', 'Buenos Aires',
'Delhi', 'Istanbul', 'Jakarta', 'Kuala Lumpur', 'Lima', 'Los Angeles',
'Manila', 'Mexico City', 'Miami', 'Mumbai', 'Philadelphia', 'Sydney',
'Tokyo', 'Toronto'],
dtype='object', name='city_name')
cluster_A = clustering_data.loc[clustering_transport_results[clustering_transport_results["cluster"] == 0].index]
cluster_B = clustering_data.loc[clustering_transport_results[clustering_transport_results["cluster"] == 1].index]
column_names = list(cluster_A.columns)
fig, axs = plt.subplots(20)
fig.set_size_inches(7, 120)
for x in range(20):
axs[x].boxplot([cluster_A[column_names[x]],cluster_B[column_names[x]]])
axs[x].set_title(column_names[x])
So in the first cluster we will have cities with high number of stations and system length per capita, unsurprisingly this cluster also does well in other transport related columns so let us see how other columns change.
GDP, overall rating, quality of life all seem to be higher in cities in cluster 1
Pollution is very similair in both clusters
Cluster 1 also seem to have less populous cities.
Area doesn't seem to change greatly however the density is definitely smaller in cluster 1.
Finally, GDP per capita is significantly larger in cluster 1 and the opposite goes for congestion rate.
Now Let's take a closer look at dataset which contains transport data from London to see how public transport shaped the city across the years
#load and prepare London data
lo_gdp = pd.read_csv("london_gdp.csv") #source: https://www.statista.com/statistics/378972/gdp-of-london/
lo_cars = pd.read_csv("cars.csv") #source: https://www.statista.com/statistics/314980/licensed-cars-in-london-england-united-kingdom/
lo_transport = pd.read_csv("tfl_data.csv") #source: https://www.kaggle.com/codebreaker619/london-public-transport-data-by-type-of-transport
lo_pop = pd.read_csv("population_london.csv") #source: https://www.macrotrends.net/cities/22860/london/population
del lo_pop[" Annual Change"]
lo_pop
yearly_data = lo_transport.copy()
del yearly_data['Period and Financial year']
del yearly_data['Reporting Period']
del yearly_data['Period beginning']
del yearly_data['Period ending']
yearly_data = yearly_data.groupby("year").sum()
yearly_data["All journeys (m)"] = yearly_data["Bus journeys (m)"] + yearly_data["Underground journeys (m)"] + yearly_data["DLR Journeys (m)"] + yearly_data["Tram Journeys (m)"] + yearly_data["Overground Journeys (m)"] + yearly_data["Emirates Airline Journeys (m)"] + yearly_data["TfL Rail Journeys (m)"]
lo_data = yearly_data.merge(lo_gdp,on="year")
lo_data = lo_data.merge(lo_pop,on="year")
lo_data = lo_data.merge(lo_cars,on="year")
lo_data["GDP per capita"] = lo_data["GDP"]/lo_data[" Population"] * 1000000000
lo_data["cars per capita"] = lo_data["cars"]/lo_data[" Population"] * 1000
lo_data["Journeys per capita"] = lo_data["All journeys (m)"]/lo_data[" Population"] *1000000
lo_data["Underground journeys per capita"] = lo_data["Underground journeys (m)"]/lo_data[" Population"] *1000000
lo_data
| year | Days in period | Bus journeys (m) | Underground journeys (m) | DLR Journeys (m) | Tram Journeys (m) | Overground Journeys (m) | Emirates Airline Journeys (m) | TfL Rail Journeys (m) | All journeys (m) | GDP | Population | cars | GDP per capita | cars per capita | Journeys per capita | Underground journeys per capita | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011 | 364 | 2326.6 | 1152.2 | 83.7 | 28.2 | 92.2 | 0.0 | 0.0 | 3682.9 | 351.807 | 8174000 | 2542.7 | 43039.760215 | 0.311072 | 450.562760 | 140.959139 |
| 1 | 2012 | 364 | 2342.0 | 1218.4 | 98.9 | 29.5 | 119.5 | 1.7 | 0.0 | 3810.0 | 367.127 | 8293000 | 2535.5 | 44269.504401 | 0.305740 | 459.423610 | 146.919088 |
| 2 | 2013 | 364 | 2348.1 | 1251.4 | 99.2 | 30.6 | 135.0 | 1.6 | 0.0 | 3865.9 | 387.199 | 8414000 | 2549.3 | 46018.421678 | 0.302983 | 459.460423 | 148.728310 |
| 3 | 2014 | 364 | 2402.3 | 1281.3 | 107.3 | 31.0 | 137.0 | 1.5 | 2.9 | 3963.3 | 413.575 | 8537000 | 2588.4 | 48445.004100 | 0.303198 | 464.249736 | 150.087853 |
| 4 | 2015 | 365 | 2361.3 | 1361.0 | 117.4 | 28.0 | 175.8 | 1.5 | 30.3 | 4075.3 | 429.171 | 8661000 | 2635.9 | 49552.130239 | 0.304341 | 470.534580 | 157.141208 |
| 5 | 2016 | 369 | 2251.0 | 1372.4 | 120.9 | 28.6 | 186.6 | 1.4 | 46.7 | 4007.6 | 453.587 | 8788000 | 2668.2 | 51614.360492 | 0.303619 | 456.030951 | 156.167501 |
| 6 | 2017 | 364 | 2252.9 | 1356.9 | 121.2 | 29.4 | 190.4 | 1.5 | 45.8 | 3998.1 | 464.075 | 8916000 | 2665.9 | 52049.685958 | 0.299002 | 448.418573 | 152.187079 |
| 7 | 2018 | 364 | 2215.5 | 1369.2 | 119.9 | 28.8 | 188.9 | 1.4 | 52.4 | 3976.1 | 483.293 | 9046000 | 2661.2 | 53426.155207 | 0.294185 | 439.542339 | 151.359717 |
| 8 | 2019 | 365 | 2180.5 | 1395.2 | 121.1 | 27.6 | 189.0 | 1.3 | 58.8 | 3973.5 | 503.653 | 9177000 | 2661.0 | 54882.096546 | 0.289964 | 432.984636 | 152.032255 |
So this is the data that we have and we can see that unfortunately there isn't much data here
Let's try to plot this data first
#plot data
lo_data.plot(subplots=True,figsize=(14,30))
array([<AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>,
<AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>,
<AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>,
<AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>,
<AxesSubplot:>], dtype=object)
So the first thing that we see is that after around 2016-2017 the use of public transport decreases and so does number of cars which is surprising considering the steady growth of population. Now there are some reasons why this happened. People might have switched to using British Railway or maybe they started cycling more. The other option is that Londoners reduced number of travels they make (which could have been caused by series of terrorist incidents in 2017) however this sounds slightly ridiculous.
Let's try looking at correlation coefficients
#calculate correlation
lo_transport_columns = ["Bus journeys (m)","Underground journeys (m)","DLR Journeys (m)",
"Tram Journeys (m)","Overground Journeys (m)","Emirates Airline Journeys (m)",
"TfL Rail Journeys (m)","All journeys (m)","Journeys per capita",
"Underground journeys per capita"]
lo_other_data_columns = list(lo_data.columns)
for x in lo_transport_columns:
lo_other_data_columns.remove(x)
#plot table
lo_data_corr = lo_data.corr()
lo_data_corr.drop(lo_other_data_columns,axis=0,inplace=True)
lo_data_corr.drop(lo_transport_columns,axis=1,inplace=True)
lo_data_corr.drop(["year","Days in period"],axis=1,inplace=True)
lo_data_corr.style.background_gradient(cmap="PiYG",vmin=-1,vmax=1,axis=None)
| GDP | Population | cars | GDP per capita | cars per capita | |
|---|---|---|---|---|---|
| Bus journeys (m) | -0.780036 | -0.801761 | -0.703100 | -0.751571 | 0.754589 |
| Underground journeys (m) | 0.943634 | 0.929473 | 0.941760 | 0.955915 | -0.770623 |
| DLR Journeys (m) | 0.921981 | 0.907316 | 0.936912 | 0.936488 | -0.737137 |
| Tram Journeys (m) | -0.364937 | -0.374180 | -0.437547 | -0.343143 | 0.223868 |
| Overground Journeys (m) | 0.939654 | 0.930863 | 0.956068 | 0.948995 | -0.758852 |
| Emirates Airline Journeys (m) | 0.350390 | 0.348537 | 0.264115 | 0.368905 | -0.404586 |
| TfL Rail Journeys (m) | 0.948629 | 0.948175 | 0.955314 | 0.942979 | -0.778770 |
| All journeys (m) | 0.762985 | 0.735059 | 0.816544 | 0.792843 | -0.547696 |
| Journeys per capita | -0.590758 | -0.626376 | -0.417801 | -0.548254 | 0.713029 |
| Underground journeys per capita | 0.684383 | 0.652240 | 0.776304 | 0.716965 | -0.435699 |
Now we have very extreme coefficients however our data sample was really small and while the negative correlations between subway, all transport ridership and car ownership seem to make sense before 2016, it can't really be seen after 2017 where usage of all modes of transport decreases.
Even though the coefficients are high they aren't really convincing as generally speaking GDP and public transport network both seem to grow with time regardless of each other
The only thing we learned is that for some reason people seem to transport less since 2017 which once again, sound ridiculous.
Perhaps ridership is not really a good way to analyse here, instead we could try to see how GDP of smaller areas in London correspond to the connection between those areas and public transport (especially across decades).
#NYC
Now let's take a closer lokk at NYC dataset and see if we got something interesting
#load and clean NYC data
nyc_ridership = pd.read_csv("ridership.csv") #source: https://en.wikipedia.org/wiki/New_York_City_Subway
nyc_pop = pd.read_csv("New York City-population-2021-09-07.csv") #source: https://www.macrotrends.net/cities/23083/new-york-city/population
del nyc_pop['Annual Change']
nyc_pop.columns = ["Year","Population"]
nyc_gdp = pd.read_csv("NGMP35620.csv") #source: https://fred.stlouisfed.org/series/NYNGSP
nyc_gdp.columns = ["Year","GDP"]
air = pd.read_csv("Air_Quality.csv") #source: https://data.cityofnewyork.us/Environment/Air-Quality/c3uy-2p5r/data
f = []
for x in air["Time Period"]:
if x.startswith("Annual"):
f.append(True)
else:
f.append(False)
e = []
for x in air["Geo Place Name"]:
if x == "New York City":
e.append(True)
else:
e.append(False)
g = []
for x in range(len(f)):
if (f[x] and e[x]):
g.append(True)
else:
g.append(False)
air = air[g]
pm = {}
for x in air.index:
if air.loc[x,"Indicator ID"] == 365:
pm[int(air.loc[x,"Time Period"][-4:])] = air.loc[x,"Data Value"]
no = {}
for x in air.index:
if air.loc[x,"Indicator ID"] == 375:
no[int(air.loc[x,"Time Period"][-4:])] = air.loc[x,"Data Value"]
no
pm
no_dataframe = pd.DataFrame(no, index = [0])
no_dataframe = no_dataframe.transpose()
no_dataframe = no_dataframe.reset_index()
no_dataframe.columns = ["Year", "NO2"]
no_dataframe = no_dataframe.sort_values("Year")
no_dataframe
pm_dataframe = pd.DataFrame(pm, index = [0])
pm_dataframe = pm_dataframe.transpose()
pm_dataframe = pm_dataframe.reset_index()
pm_dataframe.columns = ["Year", "PM2.5"]
pm_dataframe = pm_dataframe.sort_values("Year")
pm_dataframe
#pollution only data:
nyc_pollution = no_dataframe.merge(pm_dataframe, on = "Year")
nyc_pollution
| Year | NO2 | PM2.5 | |
|---|---|---|---|
| 0 | 2009 | 22.720 | 10.4300 |
| 1 | 2010 | 21.110 | 9.5400 |
| 2 | 2011 | 21.580 | 10.1000 |
| 3 | 2012 | 19.580 | 8.9500 |
| 4 | 2013 | 18.830 | 8.6300 |
| 5 | 2014 | 18.890 | 8.9300 |
| 6 | 2015 | 18.396 | 8.5604 |
| 7 | 2016 | 17.580 | 7.4500 |
| 8 | 2017 | 17.770 | 7.3800 |
| 9 | 2018 | 17.000 | 7.0000 |
nyc_data = nyc_ridership.merge(nyc_pop, on="Year")
nyc_data = nyc_data.merge(nyc_gdp, on="Year")
nyc_data["Subway ridership per capita"] = nyc_data["Subway ridership"]/nyc_data["Population"]
nyc_data["GDP per capita"] = nyc_data["GDP"]/nyc_data["Population"] * 1000000
nyc_data = nyc_data.merge(nyc_pollution,on="Year")
nyc_data
| Year | Subway ridership | Population | GDP | Subway ridership per capita | GDP per capita | NO2 | PM2.5 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2010 | 1605000000 | 18365000 | 1284183.559 | 87.394500 | 69925.595372 | 21.110 | 9.5400 |
| 1 | 2011 | 1640000000 | 18421000 | 1309603.152 | 89.028826 | 71092.945660 | 21.580 | 10.1000 |
| 2 | 2012 | 1654000000 | 18478000 | 1401401.443 | 89.511852 | 75841.619385 | 19.580 | 8.9500 |
| 3 | 2013 | 1708000000 | 18534000 | 1447038.040 | 92.154958 | 78074.783641 | 18.830 | 8.6300 |
| 4 | 2014 | 1751287621 | 18591000 | 1509213.361 | 94.200829 | 81179.783820 | 18.890 | 8.9300 |
| 5 | 2015 | 1762565419 | 18648000 | 1576387.303 | 94.517665 | 84533.853657 | 18.396 | 8.5604 |
| 6 | 2016 | 1756814800 | 18705000 | 1640015.703 | 93.922203 | 87677.931195 | 17.580 | 7.4500 |
| 7 | 2017 | 1727366607 | 18762000 | 1699894.281 | 92.067296 | 90603.042373 | 17.770 | 7.3800 |
| 8 | 2018 | 1680060402 | 18819000 | 1799148.324 | 89.274691 | 95602.759126 | 17.000 | 7.0000 |
#NYC correlations
nyc_data_corr = nyc_data.corr()
nyc_data_corr.drop(["Year","Population","GDP","GDP per capita","NO2","PM2.5"],axis=0,inplace=True)
nyc_data_corr.drop(["Year","Subway ridership","Subway ridership per capita"],axis=1,inplace=True)
nyc_data_corr.style.background_gradient(cmap="PiYG",vmin=-1,vmax=1,axis=None)
| Population | GDP | GDP per capita | NO2 | PM2.5 | |
|---|---|---|---|---|---|
| Subway ridership | 0.660815 | 0.615062 | 0.616420 | -0.710917 | -0.527833 |
| Subway ridership per capita | 0.483936 | 0.431600 | 0.433247 | -0.557802 | -0.345893 |
So once again we have little data and despite extreme coefficients we cannot really draw conclusion especially after looking at subplots which do not show these correlations:
#NYC plot
nyc_data.plot(subplots=True,figsize=(14,30))
array([<AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>,
<AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>],
dtype=object)
However once again usage of public transport peaks around 2016 and then drops. It is hard to explain as it doesn't really seem that subways get less and less overcrowded yet the numbers are undeniable. It would be really interesting to find what is causing this decrease
(or maybe we are not looking where we should and maybe there was a speciall reason for the peak)
First of all this study wasn't revolutionary at all as all the things that we learned have already been proven, however it is nice that this study can confirm our knowledge at this point and not contradict it.
What we learned in terms of our main topic is that MRT systems benefit our cities in following ways:
These make sense as availability of public transport gives people more options to work as they can easily commute to their workplace even if they do not own a car. Also the possibility of using public transport means that people do not need to live in city centers and the population can be spread out and still reach city center in short amount of time. Finally not everyone needs to own a car as they have an alternative and this I think prevents cities with good MRT from extreme traffic congestion.
What we did not find was the connection between MRT and pollution.
We also learned that the best way to measure public transport performance seems to be number of stations and system length per capita however these two factors may not be enough as we saw cities in the study that used their stations and tracks more efficiently.
Finally there are cities that have low congestion and high GDP per capita and do not really have a great MRT system so it seems that public transport is not the only way to improve economy and traffic.
Nevertheless I would highly advise to invest in public transport since we have seen that a good MRT system always benefits the population.